feat: poll and print SLURM job estimated start time while pending#464
Merged
feat: poll and print SLURM job estimated start time while pending#464
Conversation
When a SLURM job is submitted and sits in the queue, there is no feedback about when it is expected to start. This adds a background daemon thread per job that polls `squeue --start` every 30 seconds and prints the estimated start time to stdout, stopping automatically once the job leaves the pending queue. Key details: - `_poll_job_start_time`: new method guards against None stdout, non-zero return codes, and array-job multi-line output (prints only first line) - Thread is started in `schedule()` and stopped in `_cancel_existing()` and `close()`; duplicate job_id (retry) stops the old thread first - 11 new TDD tests cover all edge cases from the plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
chtruong814
previously approved these changes
Mar 15, 2026
hemildesai
previously approved these changes
Mar 15, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace fixed 30s interval with exponential backoff (30s base, 2x factor, capped at 15min) to reduce unnecessary polling for long-pending jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
malay-nagda
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a SLURM job is submitted it may sit in the pending queue for minutes or hours with no feedback. This PR adds a lightweight background daemon thread that polls
squeue --startevery 30 seconds and prints the estimated start time to stdout, stopping automatically once the job leaves the pending queue._poll_job_start_time: new method onSlurmTunnelSchedulerthat runssqueue --start --noheader -j {job_id} -o '%i|%S|%T'in a loop, printing[SLURM] Job {id} - State: PENDING, Estimated start: <time>until the job starts or the stop event is setschedule(): starts a daemon thread after_save_job_dir; stops any pre-existing thread for the same job_id first (retry/duplicate case)_cancel_existing(): signals the polling thread when a job is cancelledclose(): replaces the...stub — signals all polling threads and clears the tracking dictsEdge cases handled:
stdout=Noneguard (result.stdout or "")return_codetreated as empty (SLURM error text in stdout)12345_1,12345_2) deduplicated — only first line printed per cycle--meflag — uses-j {job_id}onlyTest plan
test/run/torchx_backend/schedulers/test_slurm.py)test_schedule*tests updated to patch_poll_job_start_time(prevents polling thread from interfering withtunnel.run.assert_called_once())uv run -- pytest test/run/torchx_backend/schedulers/test_slurm.py -vruff check+ruff formatclean🤖 Generated with Claude Code